Amazon.com - Employee Access Challenge

About Dataset

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Amazon.com - Employee Access dataset. The goal is to predict whether or not access should be granted accroding to an employee's role information and a resource code.

There are 9 independent variables (including id):

Target varibale:

Metrics:

What was done in this notebook?

Outlines

1. Import Necessary Libraries

2. EDA

2.1. Dataset Overview

The plot suggests an imbalanced dataset, where most requests are approved and only a few are rejected. To address this, techniques like upsampling or downsampling may be needed before building the model.

2.2. Compute the Linear Relationship Between Two Features

From the given pair plot, it can be observed that there is no correlation among the variables except for ROLE_TITLE and ROLE_CODE. A linear relationship is evident between ROLE_CODE and ROLE_TITLE. Since each Title has a unique ROLE_CODE, there may exist some relationship between these two variables.

2.3. Check Consistency of Training and Test Set Distributions

"MGR_ID", "ROLE_FAMILY_DESC" have categories of high proportion in the testing set not found in the training set.
We may find a way to align the data distributions to ensure they are consistent, as inconsistency can impact modeling performance.


The infrequently occurring categories can be grouped together as one category.

There has been a significant reduction in the instances of inconsistency.

2.4. Generating New Feature

As the features are randomly encoded and nominal in nature, arithmetic operations like addition or subtraction wouldn't provide substantial improvements. However, we can explore the possibility of concatenating these features together.

combination_num = n : using hashing to combine n features and create a new feature.

2.5. Category Encoding

2.5.1. KFold Target Encoding

2.5.2. Count Encoding

3. Modeling

3.1. Base Model

3.2. Model Fine-Tuning

3.2.1. RandomForestClassifier

3.2.2. ExtraTreesClassifier

3.2.3. LGBMClassifier

3.2.4. XGBClassifier

3.2.5. CatBoostClassifier

4. Ensemple Stacking